-
Notifications
You must be signed in to change notification settings - Fork 687
[historyserver][collector] Add file-level idempotency check for prev-logs processing on container restart #4321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
| } | ||
| } | ||
|
|
||
| // isFileAlreadyPersisted checks if a file has already been persisted to persist-complete-logs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isFileAlreadyPersisted detects files already moved to persist-complete-logs and avoid duplicate uploads.
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you for this PR!
do you mind provide a reproduction script like this for me to test on my env more easily?
|
Hi @my-vegetable-has-exploded, thanks for your contribution. Would you mind modifying the PR title and description to clarify that this feature is actually for container restart, not pod restart? I believe this works only for cases in which the collector sidecar fails and recovers. For now, the data is permanently erased if the head pod restarts since we use an If I have misunderstood anything, please correct me. Thanks! |
| r.prevLogsDir = "/tmp/ray/prev-logs" | ||
| r.persistCompleteLogsDir = "/tmp/ray/persist-complete-logs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move these to constant in historyserver/pkg/utils/utils.go similar to KunWuLuan@9b2ca52?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @justinyeh1995,
Thanks for pointing this out. I’m currently working on it (handling all constants at once) and will open a PR soon. I’m not sure whether it should be included here or handled in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem. I wasn't aware of that when I wrote the comment. Is the pr related to any issue though? I think I need to catch up quite a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can refer to this issue. Thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for heads up!
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz also add e2e test in this PR.
Get! I would add it tonight. |
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @lorriexingfang to review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider using filepath functions to handle file path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review. There are some same issues in this file. Maybe we can file a new ticket to handle it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed to handle it as a followup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @my-vegetable-has-exploded
do you mind create an issue so that I can assign to others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cursor review
|
bugbot run |
|
cursor review |
historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go
Outdated
Show resolved
Hide resolved
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
|
cc @JiangJiaWei1103 @CheyuWu @seanlaii to test this PR locally before you approve this, thank you! |
Signed-off-by: my-vegetable-has-exploded <[email protected]>
| // 2. Inject "leftover" logs into prev-logs via the ray-head container while collector is down. | ||
| // Note: We only inject logs, not node_events, because the collector's prev-logs processing | ||
| // currently only handles the logs directory. node_events are handled by the EventServer separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious. How can we ensure that the injection is completed during the collector downtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because Kubernetes restarts the collector immediately after kill 1, I can’t absolutely guarantee the process stays down until injection finishes. But the injection is short-lived, so current approach works well in my local environment.
If you have a better way to handle this, I’d be glad to implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go
Outdated
Show resolved
Hide resolved
historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go
Outdated
Show resolved
Hide resolved
historyserver/pkg/collector/logcollector/runtime/logcollector/collector.go
Outdated
Show resolved
Hide resolved
| return | ||
| } | ||
|
|
||
| // Check if this directory has already been processed by checking in persist-complete-logs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No changes are needed here. This note is just to clarify that we still need to check for leftover log files since the presence of log directories under persist-complete-logs directory alone does not guarantee that all logs have been fully persisted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! If the Collector crashes halfway through a node directory, persist-complete-logs already exists but some logs remain in prev-logs. Keeping this check would cause the Collector to skip the entire directory upon restart, leading to data loss. The current file-level check isFileAlreadyPersisted will handles this "partial success" scenario correctly without duplicate uploads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing it with following codes then you can trigger this case using cd historyserver && go test -v ./pkg/collector/logcollector/runtime/logcollector/ -run TestScanAndProcess.@JiangJiaWei1103
completeDir := filepath.Join(r.persistCompleteLogsDir, sessionID, nodeID, "logs")
if _, err := os.Stat(completeDir); err == nil {
logrus.Infof("Session %s node %s logs already processed, skipping", sessionID, nodeID)
return
}
And e2e test also covers this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, thanks for the detailed explanation! I originally just wanted to leave a brief note for maintainers to get why we removed the code snippet here. All tests pass in my local env. Thanks!
| } | ||
|
|
||
| // Walk through the logs directory and process all files | ||
| err := filepath.WalkDir(logsDir, func(path string, info fs.DirEntry, err error) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When collector restart, WatchPrevLogsLoops will call processPrevLogsDir, so we can process leftover log files in prev-logs/{sessionID}/{nodeID}/logs/ directories.
Co-authored-by: 江家瑋 <[email protected]> Signed-off-by: yi wang <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: my-vegetable-has-exploded <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### "too many open files" error | ||
|
|
||
| If you encounter `level=fatal msg="Create fsnotify NewWatcher error too many open files"` in the collector logs, | ||
| it is likely due to the inotify limits on the Kubernetes nodes. | ||
|
|
||
| To fix this, increase the limits on the **host nodes** (not inside the container): | ||
|
|
||
| ```bash | ||
| # Apply changes immediately | ||
| sudo sysctl -w fs.inotify.max_user_instances=8192 | ||
| sudo sysctl -w fs.inotify.max_user_watches=524288 | ||
| ``` | ||
|
|
||
| To make these changes persistent across reboots, use the following lines: | ||
|
|
||
| ```text | ||
| echo "fs.inotify.max_user_instances=8192" | sudo tee -a /etc/sysctl.conf | ||
| echo "fs.inotify.max_user_watches=524288" | sudo tee -a /etc/sysctl.conf | ||
| sudo sysctl -p | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very cool, never heard this, I learned something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @my-vegetable-has-exploded
do you mind create an issue so that I can assign to others?
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just revise it, thank you!
cc @rueian to merge
Log persistence can fail if the collector container crashes or restarts. To ensure reliability, the collector must resume processing any logs left in the prev-logs directory upon recovery.
In this PR, we enhanced the WatchPrevLogsLoops() function to perform an initial scan of the prev-logs directory at startup before entering the watch loop. This ensures that any session or node folders left from previous runs (e.g., after a container restart) are correctly processed and uploaded.
Why are these changes needed?
Related issue number
Closes #4281
Checks